Notes from Video instructions
Expectation is that we explore the data. Introduction lays out a much more robust background.
Methods section 1 is where we introduce some exploratory analysis, explore correlation analysis, scatterplots, colinearity, explore.
Methods section 2 is where we take our exploratory analysis then build some models and maybe we stop at additive, maybe we go from additive to interactive. analyze RMSE, adj.r.squared, residual std error. -No matter what we then look for how to improve our additive model by shrinking the model intelligently or growing the model.
choose a couple of models into a table
– model, AIC/BIC, number of variables, RMSE, some stats
The goal of the project is to explore the data, not make the
best model
The best way to fail is to make a path and just stick to it. Don’t put
the blinders on.
The data we have chosen to look at is Housing Prices in California. This data comes from Kaggle and outlines data that would go in to predicting the price of a house in California. As people who currently rent (and one of us living in California), we hope to one day be able to purchase a home and being able to understand this model could help us determine important factors in predicting the price and whether future ones we intend to buy are a good deal or not.
In this document we will modify and assess the data, then using our asssessments attempt to build a good model which does not overfit or underfit the data. We hope to be able to create a general model that based on some factors about a given property can give an expected price as an output.
Original Variables:
Median_House_Value: Median house value for
households within a block (measured in US Dollars) [$]
Median_Income: Median income for households within a
block of houses (measured in tens of thousands of US Dollars)
[10k$]
Median_Age: Median age of a house within a block; a
lower number is a newer building [years]
Total_Rooms: Total number of rooms within a
block
Total_Bedrooms: Total number of bedrooms within a
block
Population: Total number of people residing within a
block
Households: Total number of households, a group of
people residing within a home unit, for a block
Latitude: A measure of how far north a house is; a
higher value is farther north [°]
Longitude: A measure of how far west a house is; a
higher value is farther west [°]
Distance_to_coast: Distance to the nearest coast
point [m]
Distance_to_Los_Angeles: Distance to the center of
Los Angeles [m]
Distance_to_San_Diego: Distance to the center of San
Diego [m]
Distance_to_San_Jose: Distance to the center of San
Jose [m]
Distance_to_San_Francisco: Distance to the center of
San Francisco [m]
Variables added by the group later in the project:
dist_to_nearest_city: The numeric minimum value of
variables 11 through 14 divided by 1000 to convert to km. [km]
nearest_city: Categorical variable indicating which
city of those listed in variable 11 through 14 was the closest.[city
name]
near_a_city_100: a factor variable indicating 1 if a
house is within 100 km of the nearest city or not. [0,1]
near_a_city_200: a factor variable indicating 1 if a
house is within 200 km of the nearest city or not. [0,1]
First we need to load in the data and prepare some of the columns.
library(readr)
housing_data = read.csv("California_Houses.csv")
To augment the data a bit, we need to take the predictors
Distance to_Los Angeles,
Distance to Los Angeles,
Distance to Los Angeles, and
Distance to Los Angeles and convert them into three
separate columns: - a factor variable nearest_city- This
segments the data into regions of California and if there is any
relevance in being closer to one city vs. another. - a numeric variable
dist_to_nearest_city - This will give us the distance to
this nearest city in kms.
nearest_city = rep("", nrow(housing_data))
dist_to_nearest = rep(0, nrow(housing_data))
near_city = rep(0, nrow(housing_data))
nearest_city_options = c("LA", "San Diego", "San Jose", "San Fransisco")
for (i in 1:nrow(housing_data)) {
subset = housing_data[i,
c("Distance_to_LA",
"Distance_to_SanDiego",
"Distance_to_SanJose",
"Distance_to_SanFrancisco")]
nearest_city[i] = nearest_city_options[which.min(subset)]
dist_to_nearest[i] = min(subset) / 1000
# near_city[i] = ifelse(dist_to_nearest[i] < 100, 1, 0)
}
housing_data$nearest_city = as.factor(nearest_city)
housing_data$dist_to_nearest_city = dist_to_nearest
# housing_data$near_a_city = as.factor(near_city)
We will then perform a quick assessment of the variables we just created. First we will inspect what percentage of the properties is closest to each city
data.frame(
Los_Angeles = mean(housing_data$nearest_city == "LA"),
San_Diego = mean(housing_data$nearest_city == "San Diego"),
San_Jose = mean(housing_data$nearest_city == "San Jose"),
San_Fransisco = mean(housing_data$nearest_city == "San Fransisco"))
## Los_Angeles San_Diego San_Jose San_Fransisco
## 1 0.4759 0.09685 0.1824 0.2449
And overall numbers for nearest city:
summary(housing_data$nearest_city)
## LA San Diego San Fransisco San Jose
## 9823 1999 5054 3764
We will also run a quick sanity check that ensures that based on latitude and longitude we do indeed have the correct nearest city.
plot(Latitude ~ Longitude, housing_data,
col = nearest_city,
pch = as.numeric(nearest_city))
legend("topright",
legend = levels(housing_data$nearest_city),
col = c(1:4),
pch = c(1:4))
Next we will gather some data on how far data points are from the nearest city.
summary(housing_data$dist_to_nearest_city)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4 17.2 36.1 69.5 93.1 489.6
And we will plot the distribution of distance as both a box plot and a histogram for help framing these distances.
par(mfrow = c(1,2))
boxplot(housing_data$dist_to_nearest_city,
ylab = "Distance to Nearest City [km]",
main = "Boxplot of Distances to Nearest\n City")
hist(housing_data$dist_to_nearest_city,
xlab = "Distnace ot Nearest City [km]",
main = "Histogram of Distances to Nearest City")
Between the summary information and the boxplot, we can assess that
more than 3/4 of the properties in the dataset are within 100 km of the
nearest city. We will use this to create a new variable called
near_city_100 a factor variable which evaluates to 1 if
within 100 km of a city and 0 otherwise.
housing_data$near_a_city_100 = as.factor(housing_data$dist_to_nearest_city < 100)
We also see that around the 200km mark is where the whisker ends on our boxplot above so we will investigate what proportion of data falls within 200 km
mean(housing_data$dist_to_nearest_city < 200)
## [1] 0.9161
given that more than 91% of the data falls within 200 km of a city we will also create another factor variable for this value.
housing_data$near_a_city_200 = as.factor(housing_data$dist_to_nearest_city < 200)
Our hope here is that one of the two factor variables created will be a sufficient demarcation line to where certain variables start to have differing effects on the response when we build a model. We will explore that further later.
mean(as.numeric(housing_data$near_a_city_100) - 1)
## [1] 0.7664
76.64% percent of data points are within 100 km’s of the center of the nearest city.
mean(as.numeric(housing_data$near_a_city_200) - 1)
## [1] 0.9161
91.61% percent of data points are within 100 km’s of the center of the nearest city.
Having harvested the data from the distance to each city variable we will now eliminate them from the dataset in order to make plotting and analysis more manageable.
housing_data = subset(housing_data,
select = -c(Distance_to_LA,
Distance_to_SanFrancisco,
Distance_to_SanDiego,
Distance_to_SanJose))
data.frame(name = names(housing_data))
## name
## 1 Median_House_Value
## 2 Median_Income
## 3 Median_Age
## 4 Tot_Rooms
## 5 Tot_Bedrooms
## 6 Population
## 7 Households
## 8 Latitude
## 9 Longitude
## 10 Distance_to_coast
## 11 nearest_city
## 12 dist_to_nearest_city
## 13 near_a_city_100
## 14 near_a_city_200
With the data loaded and prepped we want to start building the model.
Before we do that, we want to check the pairs of all the different
predictor variables to see what predictors have strong correlations. We
will leave out - Latitude - Longitude And
represent nearest_city as a color and
near_a_city_100 by symbol.
plot(housing_data[ , c(2:7, 10,12)],
col = as.numeric(housing_data$nearest_city),
main = "Plot of Every Variable vs Every Other Variable in Housing Data (some withheld)")
A couple of obvious colinearities jump out.
- Tot_Rooms - Tot_Bedrooms
cor(housing_data$Tot_Rooms, housing_data$Tot_Bedrooms)
## [1] 0.9299
Tot_Rooms - Populationcor(housing_data$Tot_Rooms, housing_data$Population)
## [1] 0.8571
Tot_Rooms - Householdscor(housing_data$Tot_Rooms, housing_data$Households)
## [1] 0.9185
Tot_Bedrooms - Populationcor(housing_data$Tot_Bedrooms, housing_data$Population)
## [1] 0.878
Tot_Bedrooms - Householdscor(housing_data$Tot_Bedrooms, housing_data$Households)
## [1] 0.9798
Population - Householdcor(housing_data$Population, housing_data$Households)
## [1] 0.9072
It appears that the variables that have to do with density see a
strong positive correlation. This makes sense. As the total number of
people within a block (population) increases, you also see
an increase in total households within a block (Households)
which leads to an increase in both total bedrooms
(Total_Bedrooms) and total rooms
(Total_Rooms). We are not attempting to demonstrate
causation, simply how density indicators are linked to each other.
These variables may still have interactions that we will explore later. For instance an area where the population density is low and the number of rooms is high or the number of households is low but the number of rooms is high may indicate an increase in house value. We will keep this in mind for later.
We will also explore variable correlations within the context of whether they are close to or far away from the city, in order to check to see if any patterns emerge within either that were otherwise hidden.
str(housing_data)
## 'data.frame': 20640 obs. of 14 variables:
## $ Median_House_Value : num 452600 358500 352100 341300 342200 ...
## $ Median_Income : num 8.33 8.3 7.26 5.64 3.85 ...
## $ Median_Age : int 41 21 52 52 52 52 52 52 42 52 ...
## $ Tot_Rooms : int 880 7099 1467 1274 1627 919 2535 3104 2555 3549 ...
## $ Tot_Bedrooms : int 129 1106 190 235 280 213 489 687 665 707 ...
## $ Population : int 322 2401 496 558 565 413 1094 1157 1206 1551 ...
## $ Households : int 126 1138 177 219 259 193 514 647 595 714 ...
## $ Latitude : num 37.9 37.9 37.9 37.9 37.9 ...
## $ Longitude : num -122 -122 -122 -122 -122 ...
## $ Distance_to_coast : num 9263 10226 8259 7768 7768 ...
## $ nearest_city : Factor w/ 4 levels "LA","San Diego",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ dist_to_nearest_city: num 21.3 20.9 18.8 18 18 ...
## $ near_a_city_100 : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
## $ near_a_city_200 : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
city = housing_data$near_a_city_100 == TRUE
non_city = ! city
plot(housing_data[ city, c(2:7, 10,12)],
main = "City Based Plots",
col = "darkgray")
plot(housing_data[ non_city, c(2:7, 10,12)],
main = "Non-City Based Plots",
col = "darkblue")
No further trends obviously emerge from breaking out the data into city vs non-city.
Before we move on from this we will attmept to see if any of the cities have colinearities specific to their locality. In order to do this we will repeat the previous step one for each city using only the city data, in hopes of isolating information specific to the major cities which cover most of the data.
levels(housing_data$nearest_city)
## [1] "LA" "San Diego" "San Fransisco" "San Jose"
la = housing_data$nearest_city == "LA" & city
sd = housing_data$nearest_city == "San Diego" & city
sf = housing_data$nearest_city == "San Fransisco" & city
sj = housing_data$nearest_city == "San Jose" & city
plot(housing_data[la, c(2:7, 10,12)],
main = "Los Angeles Based Plots",
col = 1)
plot(housing_data[sd, c(2:7, 10,12)],
main = "San Diego Based Plots",
col = 2)
plot(housing_data[sf, c(2:7, 10,12)],
main = "San Fransisco Based Plots",
col = 3)
plot(housing_data[sj, c(2:7, 10,12)],
main = "San Jose Based Plots",
col = 4)
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
?pairs
#commented this because I modified the columns and didn't want to mess with your mojo.
# ggpairs(housing_data,
# columns = c(1, 2:5), # Columns
# aes(color = nearest_city, # Color by group (cat. variable)
# alpha = 0.5))
# ggpairs(housing_data,
# columns = c(1, 6:9), # Columns
# aes(color = nearest_city, # Color by group (cat. variable)
# alpha = 0.5))
# ggpairs(housing_data,
# columns = c(1, 10:13), # Columns
# aes(color = nearest_city, # Color by group (cat. variable)
# alpha = 0.5))
# ggpairs(housing_data,
# columns = c(1, 14:15), # Columns
# aes(color = nearest_city, # Color by group (cat. variable)
# alpha = 0.5))
It looks like median age has little to do with predicting the house value, so removing that will reduce our model size. –Can we hold off on cutting age? Based on what I understand about realestate and tax brackets it may be relevant later… Though if it includes children might be a red herring because kids are balancing adult age. –
The last graph we will create for assistance is a graph of
Median_House_Value vs all other predictors. We will leave
the graphs somewhat sparse to allow for
par(mfrow = c(3,3))
plot(Median_House_Value ~ Median_Income, housing_data, col = 1)
plot(Median_House_Value ~ Median_Age, housing_data, col = 2)
plot(Median_House_Value ~ Tot_Rooms, housing_data, col = 3)
plot(Median_House_Value ~ Tot_Bedrooms, housing_data, col = 4)
plot(Median_House_Value ~ Population, housing_data, col = 5)
plot(Median_House_Value ~ Households, housing_data, col = 6)
plot(Median_House_Value ~ Latitude, housing_data, col = 7)
plot(Median_House_Value ~ Distance_to_coast, housing_data, col = 8)
plot(Median_House_Value ~ dist_to_nearest_city, housing_data, col = 9)
# housing_data = subset(housing_data,select = -c(Median_Age))
set.seed(420)
housing_data_idx = sample(nrow(housing_data), size = trunc(0.80 * nrow(housing_data)))
housing_data_trn = housing_data[housing_data_idx, ]
housing_data_tst = housing_data[-housing_data_idx, ]
add_model = lm(Median_House_Value ~ ., data = housing_data_trn)
int_model = lm(Median_House_Value ~ (.) ^ 2, data = housing_data_trn)
library(faraway)
##
## Attaching package: 'faraway'
## The following object is masked from 'package:GGally':
##
## happy
vif(add_model)
## Median_Income Median_Age Tot_Rooms
## 1.841 1.439 13.198
## Tot_Bedrooms Population Households
## 39.734 6.505 40.066
## Latitude Longitude Distance_to_coast
## 37.838 28.112 5.389
## nearest_citySan Diego nearest_citySan Fransisco nearest_citySan Jose
## 1.862 17.038 8.326
## dist_to_nearest_city near_a_city_100TRUE near_a_city_200TRUE
## 9.131 4.521 2.905
add_model = lm(Median_House_Value ~ ., data = housing_data_trn)
vif(add_model)
## Median_Income Median_Age Tot_Rooms
## 1.841 1.439 13.198
## Tot_Bedrooms Population Households
## 39.734 6.505 40.066
## Latitude Longitude Distance_to_coast
## 37.838 28.112 5.389
## nearest_citySan Diego nearest_citySan Fransisco nearest_citySan Jose
## 1.862 17.038 8.326
## dist_to_nearest_city near_a_city_100TRUE near_a_city_200TRUE
## 9.131 4.521 2.905
#mod = step(int_model, direction = "backward", trace = 0)